Original Authors: Belinda Phipson, Anna Trigos, Matt Ritchie, Maria Doyle, Harriet Dashnow, Charity Law Based on the course RNAseq analysis in R delivered on May 11/12th 2016

Before starting this section, we will make sure we have all the relevant objects from the Differential Expression analysis present.

suppressPackageStartupMessages(library(edgeR))
load("Robjects/DE.Rdata")

Overview

  • Visualising DE results
  • Getting annotation
  • Retrieving gene models
  • Exporting browser traecks
  • Visualising results with respect to genomic location

Adding annotation and saving the results

We have a list of significantly differentially expressed genes, but the only annotation we can see is the Entrez Gene ID, which is not very informative.

results <- as.data.frame(topTags(lrt.BvsL,n = Inf))
results

edgeR provides a function plotSmear that allows us to visualise the results of a DE analysis. In a similar manner to the MA-plot for microarray data, this plot shows the log-fold change against log-counts per million, with DE genes highlighted:

summary(de <- decideTestsDGE(lrt.BvsL))
   [,1]
-1 2404
0  9583
1  1322
detags <- rownames(dgeObj)[as.logical(de)]
plotSmear(lrt.BvsL, de.tags=detags)

However, on such a plot it would be nice to add labels to highlight the genes with most evidence for being DE, or our favourite genes. To perform such a task we need to map between the identifiers we have in the edgeR output and more familiar names.

There are a number of ways to add annotation, but we will demonstrate how to do this using the org.Mm.eg.db package. This package is one of several organism-level packages which are re-built every 6 months. These packages are listed on the annotation section of the Bioconductor, and are installed in the same way as regular Bioconductor packages. An alternative approach is to use biomaRt, an interface to the BioMart resource. BioMart is much more comprehensive, but the organism package fit better into the Bioconductor workflow.

source("http://www.bioconductor.org/biocLite.R")
biocLite("org.Mm.eg.db")
# For Human
biocLite("org.Hs.eg.db")

The packages are larger in size that Bioconductor software pacakges, but essentially they are databases that can be used to make offline queries.

library(org.Mm.eg.db)
Loading required package: AnnotationDbi
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap,
    parApply, parCapply, parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB

The following object is masked from ‘package:limma’:

    plotMA

The following objects are masked from ‘package:stats’:

    IQR, mad, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, cbind, colnames, do.call, duplicated, eval,
    evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, lengths,
    Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position,
    rank, rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, union, unique,
    unsplit, which, which.max, which.min

Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with 'browseVignettes()'. To cite
    Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: IRanges
Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

    colMeans, colSums, expand.grid, rowMeans, rowSums

First we need to decide what information we want. In order to see what we can extract we can run the columns function on the annotation database.

columns(org.Mm.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GO"           "GOALL"        "IPI"          "MGI"          "ONTOLOGY"    
[16] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"         "PROSITE"     
[21] "REFSEQ"       "SYMBOL"       "UNIGENE"      "UNIPROT"     

We are going to filter the database by a key or set of keys in order to extract the information we want. Valid names for the key can be retrieved with the keytypes function.

keytypes(org.Mm.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"    
 [7] "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
[13] "IPI"          "MGI"          "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
[19] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UNIGENE"      "UNIPROT"     

We should see ENTREZID, which is the type of key we are going to use in this case. If we are unsure what values are acceptable for the key, we can check what keys are valid with keys

keys(org.Mm.eg.db, keytype="ENTREZID")[1:10]
 [1] "11287" "11298" "11302" "11303" "11304" "11305" "11306" "11307" "11308" "11350"

It is a useful sanity check to make sure that the keys you want to use are all valid. We could use %in% in this case.

## Build up the query step-by-step
my.keys <- c("50916", "110308","12293")
my.keys %in% keys(org.Mm.eg.db, keytype="ENTREZID")
[1] TRUE TRUE TRUE
all(my.keys %in% keys(org.Mm.eg.db, keytype="ENTREZID"))
[1] TRUE

Let’s build up the query step by step.

## to be filled-in interactively during the class.
select(org.Mm.eg.db,

To annotate our results, we definitely want gene symbols and perhaps the full gene name. Let’s build up our annotation information in a separate data frame using the select function.

ann <- select(org.Mm.eg.db,keys=rownames(results),columns=c("ENTREZID","SYMBOL","GENENAME"))
'select()' returned 1:1 mapping between keys and columns
# Have a look at the annotation
ann

Let’s double check that the ENTREZID column matches exactly to our results rownames.

table(ann$ENTREZID==rownames(results))

 TRUE 
13309 

We can bind in the annotation information to the results data frame. (Please note that if the select function returns a 1:many mapping then you can’t just append the annotation to the fit object.)

results.annotated <- cbind(results, ann)
results.annotated

We can save the results table using the write.csv function, which writes the results out to a csv file that you can open in excel.

write.csv(results.annotated,file="B.PregVsLacResults.csv",row.names=FALSE)

A note about deciding how many genes are significant: In order to decide which genes are differentially expressed, we usually take a cut-off of 0.05 on the adjusted p-value, NOT the raw p-value. This is because we are testing more than 15000 genes, and the chances of finding differentially expressed genes is very high when you do that many tests. Hence we need to control the false discovery rate, which is the adjusted p-value column in the results table. What this means is that if 100 genes are significant at a 5% false discovery rate, we are willing to accept that 5 will be false positives. Note that the decideTests function displays significant genes at 5% FDR.

Challenge

Re-visit the plotSmear plot from above and use the text function to add labels for the names of the top 200 most DE genes

Another common visualisation is the volcano plot which display a measure of significance on the y-axis and fold-change on the x-axis.

signif <- -log10(results.annotated$FDR)
plot(results.annotated$logFC,signif,pch=16)
points(results.annotated[detags,"logFC"],-log10(results.annotated[detags,"FDR"]),pch=16,col="red")

Before following up on the DE genes with further lab work, a recommended sanity check is to have a look at the expression levels of the individual samples for the genes of interest. We can quickly look at grouped expression using stripchart. We can use the normalised log expression values in the dgeCounts object (dgeCounts$counts).

library(RColorBrewer)
par(mfrow=c(1,3))
normCounts <- dgeObj$counts
# Let's look at the first gene in the topTable, Irx4, which has a rowname 50916
stripchart(normCounts["50916",]~group)
# This plot is ugly, let's make it better
stripchart(normCounts["50916",]~group,vertical=TRUE,las=2,cex.axis=0.8,pch=16,col=1:6,method="jitter")
# Let's use nicer colours
nice.col <- brewer.pal(6,name="Dark2")
stripchart(normCounts["50916",]~group,vertical=TRUE,las=2,cex.axis=0.8,pch=16,cex=1.3,col=nice.col,method="jitter",ylab="Normalised log2 expression",main=" Irx4")

An interactive version of the volcano plot above that includes the raw per sample values in a separate panel is possible via the glXYPlot function in the Glimma package.

library(Glimma)
group2 <- group
levels(group2) <- c("basal.lactate","basal.preg","basal.virgin","lum.lactate", "lum.preg", "lum.virgin")
glXYPlot(x=results$logFC, y=-log10(results$FDR),
         xlab="logFC", ylab="B", main="B.PregVsLac",
         counts=normCounts, groups=group2, status=de,
         anno=ann, id.column="ENTREZID", folder="volcano")

This function creates an html page (./volcano/XY-Plot.html) with a volcano plot on the left and a plot showing the log-CPM per sample for a selected gene on the right. A search bar is available to search for genes of interest.

Retrieving Genomic Locations

It might seem natural to add genomic locations to our annotation table, and possibly a bit odd that the org.Mm.eg.db package does not supply such mappings. In fact, there is a whole suite of package for performing this, and more-advanced queries. These are listed on the Bioconductor annotation page and have the prefix TxDb.

The package we will be using is TxDb.Mmusculus.UCSC.mm10.knownGene. Packages are available for other organisms and genome builds. It is even possible to build your own database if one does not exist. See vignette("GenomicFeatures") for details

source("http://www.bioconductor.org/biocLite.R")
biocLite("TxDb.Mmusculus.UCSC.mm10.knownGene")

## For Humans
biocLite("TxDb.Hsapiens.UCSC.hg19.knownGene")

We load the library in the usual fashion and create a new object to save some typing. As with the org. packages, we can query what columns are available with columns,

library(TxDb.Mmusculus.UCSC.mm10.knownGene)
Loading required package: GenomicFeatures
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
tx <- TxDb.Mmusculus.UCSC.mm10.knownGene
columns(tx)
 [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSSTART"   "CDSSTRAND" 
 [7] "EXONCHROM"  "EXONEND"    "EXONID"     "EXONNAME"   "EXONRANK"   "EXONSTART" 
[13] "EXONSTRAND" "GENEID"     "TXCHROM"    "TXEND"      "TXID"       "TXNAME"    
[19] "TXSTART"    "TXSTRAND"   "TXTYPE"    

The select function is used in the same manner as the org.Mm.eg.db packages.

keys <- c("12992", "13358","20531")
select(tx, keys=keys,
       keytype = "GENEID",
       columns=c("EXONID","EXONSTART","EXONEND"))
'select()' returned 1:many mapping between keys and columns

One of the real strengths of the txdb.. packages is the ability of interface with GenomicRanges, which is the object type used throughout Bioconductor to manipulate Genomic Intervals.

These object types permit us to perform common operations on intervals such as overlapping and counting.

library(GenomicRanges)
my.range <-GRanges("chr1", IRanges(start=1000,end=2000))
my.range
GRanges object with 1 range and 0 metadata columns:
      seqnames       ranges strand
         <Rle>    <IRanges>  <Rle>
  [1]     chr1 [1000, 2000]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Let’s try some dummy intervals with random chromsome assignments and start positions.

set.seed(10042017)
chrs <- sample(paste0("chr",1:19),100,replace=TRUE)
startPos <- sample(1:5e6,100,replace=TRUE)
my.rangesA <- GRanges(chrs, IRanges(start=startPos,width=1000))
my.rangesA
GRanges object with 100 ranges and 0 metadata columns:
        seqnames             ranges strand
           <Rle>          <IRanges>  <Rle>
    [1]    chr19 [1534111, 1535110]      *
    [2]    chr10 [4608224, 4609223]      *
    [3]     chr7 [2546454, 2547453]      *
    [4]     chr7 [3599137, 3600136]      *
    [5]    chr18 [ 895462,  896461]      *
    ...      ...                ...    ...
   [96]    chr12 [3244195, 3245194]      *
   [97]    chr11 [3320136, 3321135]      *
   [98]    chr11 [ 418010,  419009]      *
   [99]    chr12 [1351347, 1352346]      *
  [100]     chr3 [2710560, 2711559]      *
  -------
  seqinfo: 19 sequences from an unspecified genome; no seqlengths

There are a number of useful functions for calculating properties of the data (such as coverage or sorting). It is even possible to convert between different chromosome naming conventions.

width(my.rangesA)
  [1] 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000
 [17] 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000
 [33] 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000
 [49] 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000
 [65] 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000
 [81] 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000
 [97] 1000 1000 1000 1000
sort(my.rangesA)
GRanges object with 100 ranges and 0 metadata columns:
        seqnames             ranges strand
           <Rle>          <IRanges>  <Rle>
    [1]    chr19 [ 496023,  497022]      *
    [2]    chr19 [1534111, 1535110]      *
    [3]    chr19 [2444002, 2445001]      *
    [4]    chr19 [3406152, 3407151]      *
    [5]    chr10 [  63660,   64659]      *
    ...      ...                ...    ...
   [96]     chr4 [4437249, 4438248]      *
   [97]     chr4 [4853087, 4854086]      *
   [98]     chr3 [2710560, 2711559]      *
   [99]     chr3 [2867165, 2868164]      *
  [100]    chr13 [2362583, 2363582]      *
  -------
  seqinfo: 19 sequences from an unspecified genome; no seqlengths
coverage(my.rangesA)
RleList of length 19
$chr19
integer-Rle of length 3407151 with 8 runs
  Lengths:  496022    1000 1037088    1000  908891    1000  961150    1000
  Values :       0       1       0       1       0       1       0       1

$chr10
integer-Rle of length 4609223 with 12 runs
  Lengths:   63659    1000 1012827    1000  959543 ...  860029    1000  405082    1000
  Values :       0       1       0       1       0 ...       0       1       0       1

$chr7
integer-Rle of length 4627728 with 16 runs
  Lengths:   63290    1000 1212862    1000  493363 ...  630989    1000 1026592    1000
  Values :       0       1       0       1       0 ...       0       1       0       1

$chr18
integer-Rle of length 4436160 with 14 runs
  Lengths:  160693    1000  396731    1000  336037 ... 1006023    1000   87480    1000
  Values :       0       1       0       1       0 ...       0       1       0       1

$chr11
integer-Rle of length 4755123 with 14 runs
  Lengths:  418009    1000  455255    1000  943607 ... 1307517    1000  124471    1000
  Values :       0       1       0       1       0 ...       0       1       0       1

...
<14 more elements>
seqlevelsStyle(my.rangesA)
[1] "UCSC"
keepSeqlevels(my.rangesA,"chr19")
GRanges object with 4 ranges and 0 metadata columns:
      seqnames             ranges strand
         <Rle>          <IRanges>  <Rle>
  [1]    chr19 [1534111, 1535110]      *
  [2]    chr19 [2444002, 2445001]      *
  [3]    chr19 [ 496023,  497022]      *
  [4]    chr19 [3406152, 3407151]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
#seqlevelsStyle(my.rangesA) <- "Ensembl"
#my.rangesA
my.rangesB <- shift(my.rangesA, shift = 500)
my.rangesB <- resize(my.rangesB,width =sample(1:1000,100,replace = TRUE))
my.rangesB
GRanges object with 100 ranges and 0 metadata columns:
        seqnames             ranges strand
           <Rle>          <IRanges>  <Rle>
    [1]    chr19 [1534611, 1535582]      *
    [2]    chr10 [4608724, 4609486]      *
    [3]     chr7 [2546954, 2546968]      *
    [4]     chr7 [3599637, 3599782]      *
    [5]    chr18 [ 895962,  896517]      *
    ...      ...                ...    ...
   [96]    chr12 [3244695, 3244941]      *
   [97]    chr11 [3320636, 3321524]      *
   [98]    chr11 [ 418510,  419236]      *
   [99]    chr12 [1351847, 1352549]      *
  [100]     chr3 [2711060, 2711736]      *
  -------
  seqinfo: 19 sequences from an unspecified genome; no seqlengths

It is quite straightforward to translate the output of a select query into a GenomicFeatures object. However, several convenience functions exist to retrieve the structure of every gene for a given organism in one object.

The output of exonsBy is a list, where each item in the list is the exon co-ordinates of a particular gene.

exo <- exonsBy(tx,"gene")
exo
GRangesList object of length 24116:
$100009600 
GRanges object with 7 ranges and 2 metadata columns:
      seqnames               ranges strand |   exon_id   exon_name
         <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]     chr9 [21062393, 21062717]      - |    134539        <NA>
  [2]     chr9 [21062894, 21062987]      - |    134540        <NA>
  [3]     chr9 [21063314, 21063396]      - |    134541        <NA>
  [4]     chr9 [21066024, 21066377]      - |    134542        <NA>
  [5]     chr9 [21066940, 21067925]      - |    134543        <NA>
  [6]     chr9 [21068030, 21068117]      - |    134544        <NA>
  [7]     chr9 [21073075, 21075496]      - |    134546        <NA>

$100009609 
GRanges object with 6 ranges and 2 metadata columns:
      seqnames               ranges strand | exon_id exon_name
  [1]     chr7 [84940169, 84941088]      - |  109989      <NA>
  [2]     chr7 [84943141, 84943264]      - |  109990      <NA>
  [3]     chr7 [84943504, 84943722]      - |  109991      <NA>
  [4]     chr7 [84946200, 84947000]      - |  109992      <NA>
  [5]     chr7 [84947372, 84947651]      - |  109993      <NA>
  [6]     chr7 [84963816, 84964009]      - |  109994      <NA>

$100009614 
GRanges object with 1 range and 2 metadata columns:
      seqnames               ranges strand | exon_id exon_name
  [1]    chr10 [77711446, 77712009]      + |  143986      <NA>

...
<24113 more elements>
-------
seqinfo: 66 sequences (1 circular) from mm10 genome

To access the structure of a particular gene, we can use the [[ syntax with the name of the gene (Entrez gene ID) within quote marks.

exo[["12992"]]
GRanges object with 14 ranges and 2 metadata columns:
       seqnames               ranges strand |   exon_id   exon_name
          <Rle>            <IRanges>  <Rle> | <integer> <character>
   [1]     chr5 [87808121, 87808165]      + |     68233        <NA>
   [2]     chr5 [87809897, 87809959]      + |     68234        <NA>
   [3]     chr5 [87813088, 87813114]      + |     68235        <NA>
   [4]     chr5 [87813838, 87813861]      + |     68236        <NA>
   [5]     chr5 [87813941, 87813982]      + |     68237        <NA>
   ...      ...                  ...    ... .       ...         ...
  [10]     chr5 [87819066, 87819095]      + |     68242        <NA>
  [11]     chr5 [87819066, 87819110]      + |     68243        <NA>
  [12]     chr5 [87820951, 87820995]      + |     68244        <NA>
  [13]     chr5 [87822212, 87822322]      + |     68245        <NA>
  [14]     chr5 [87824139, 87824421]      + |     68246        <NA>
  -------
  seqinfo: 66 sequences (1 circular) from mm10 genome

Exporting tracks

It is also possible to save the results of a Bioconductor analysis in a browser to enable interactive analysis and integration with other data types, or sharing with collaborators. For instance, we might want a browser track to indicate where our differentially-expressed genes are located. We shall use the bed format to display these locations.

At the moment, we have a GenomicFeatures object that represents every exon. However, we do not need this level of granularity for the bed output

Select the names of the statistically significant genes from the edgeR output in the usual manner.

tt <- topTable(fit.cont,coef="B.PregVsLac",n=Inf)
Error in is(fit, "MArrayLM") : object 'fit.cont' not found

Use the range function to obtain a single range for every gene and then get the genomic intervals that correspond to these DE genes.

exoRanges <- unlist(range(exo))
sigRegions <- exoRanges[na.omit(match(sigGenes$ENTREZID, names(exoRanges)))]
sigRegions

Rather than just representing the genomic locations, the .bed format is also able to colour each range according to some property of the analysis (e.g. direction and magnitude of change) to help highlight particular regions of interest. A score can also be displayed when a particular region is clicked-on. A useful propery of GenomicRanges is that we can attach metadata to each range using the mcols function. The metadata can be supplied in the form of a data frame.

mcols(sigRegions) <- sigGenes[match(names(sigRegions), sigGenes[,1]),]
sigRegions
seqlevels(sigRegions)
sigRegions <- keepSeqlevels(sigRegions, paste0("chr", c(1:19,"X","Y")))

Create a score from the p-values that will displayed under each region, and colour scheme for the regions based on the fold-change. For convenience, restrict the fold changes to be within the region -3 to 3.

sigRegions
Score <- -log10(sigRegions$adj.P.Val)
rbPal <-colorRampPalette(c("red", "blue"))
logfc <- pmax(sigRegions$logFC, -3)
logfc <- pmin(logfc , 3)
Col <- rbPal(10)[as.numeric(cut(logfc, breaks = 10))]

The colours and score can be saved in the GRanges object and score and itemRgb columns respectively and will be used to construct the track when exporting. The rtracklayer function can be used to import and export browsers tracks.

Export the signifcant results from the DE analysis as a .bed track using rtracklayer. You can load the resulting file in IGV, if you wish.

mcols(sigRegions)$score <- Score
mcols(sigRegions)$itemRgb <- Col
library(rtracklayer)
export(sigRegions , con = "topHits.bed")

Extracting Reads

We haven’t discussed much

library(GenomicAlignments)
generegion <- exo[["12992"]]
getwd()
bam <- readGAlignments(file="bam/MCL1.DG.bam",
                       param=ScanBamParam(which=generegion))
bam

An aside - Acessing Genome Sequence

Brief Introduction to ggplot2

library(ggplot2)

top200 <- tt[1:200,]

ggplot(tt, aes(x = logFC, y=B)) + geom_point(alpha=0.4) + scale_color_manual(values=c("black","red")) + geom_text(data=top200, aes(x=logFC,y=B,label=SYMBOL),col="blue",alpha=0.4)

Challenge

Use ggplot2 to re-construct the MA- plot from above.

Solution

Composing plots with ggbio

We will now take a brief look at one of the visualisation packages in Bioconductor that takes advantage of the GenomicRanges and GenomicFeatures object-types. In this section we will show a worked example of how to combine several types of genomic data on the same plot. The documentation for ggbio is very extensive and contains lots of examples.

http://www.tengfei.name/ggbio/docs/

The Gviz package is another Bioconductor package that specialising in genomic visualisations, but we will not explore this package in the course.

The Manhattan plot is a common way of visualising genome-wide results, and this is implemented as the plotGrandLinear function. We have to supply a value to display on the y-axis, typically this is some measure of significance. Changing this variable displayed on the y-axis is done by using the aes function, which is inherited from ggplot2. The positioning of points on the x-axis is handled automatically by ggbio, using the ranges information to get the genomic coordinates of the ranges of interest.

library(ggbio)
plotGrandLinear(sigRegions , aes(y = score))

As we discussed, an appealing feature of ggplot2 is the ability to map properties of your plot to variables present in your data. For example, we could create a variable to distinguish between up- and down-regulated genes. The variables used for aesthetic mapping must be present in the mcols section of your ranges object.

mcols(sigRegions)$Up <- logfc > 0
plotGrandLinear(sigRegions, aes(y = logFC, col = Up))

A useful function within ggbio is autoplot, which will construct an appropriate plot based on the object-type of the input. For example, ggbio is able to plot the structure of genes according to a particular model represented by a GenomicFeatures object.

autoplot(tx, which=exo[["24117"]])

We can even plot the location of sequencing reads if they have been imported using readGAlignments function (or similar).

myreg <- flank(reduce(exo[["24117"]]), 1000, both = T)
bam <- readGAlignments(file="bam/MCL1.DG.bam",
                       param=ScanBamParam(which=myreg),use.names = TRUE)

autoplot(bam,which=myreg)
autoplot(bam , stat = "coverage")

Like ggplot2, ggbio plots can be saved as objects that can later be modified, or combined together to form more complicated plots. If saved in this way, the plot will only be displayed on a plotting device when we query the object. This strategy is useful when we want to add a common element (such as an ideogram) to a plot composition and don’t want to repeat the code to generate the plot every time.

#idPlot <- plotIdeogram(genome = "mm10")
#idPlot
geneMod <- autoplot(tx, which = myreg)
Error: could not find function "autoplot"
---
title: "RNA-seq Analysis in R"
subtitle: "Annotation and Visualisation of RNA-seq results"
author: "Stephane Ballereau, Mark Dunning, Oscar Rueda, Ashley Sawle"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output:
  html_notebook:
    toc: yes
    toc_float: yes
  html_document:
    toc: yes
    toc_float: yes
minutes: 300
layout: page
bibliography: ref.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

**Original Authors: Belinda Phipson, Anna Trigos, Matt Ritchie, Maria Doyle, Harriet Dashnow, Charity Law**
Based on the course [RNAseq analysis in R](http://combine-australia.github.io/2016-05-11-RNAseq/) delivered on May 11/12th 2016

Before starting this section, we will make sure we have all the relevant objects from the Differential Expression analysis present.

```{r}
suppressPackageStartupMessages(library(edgeR))
load("Robjects/DE.Rdata")
```

# Overview

- Visualising DE results
- Getting annotation
- Retrieving gene models
- Exporting browser traecks
- Visualising results with respect to genomic location

## Adding annotation and saving the results

We have a list of significantly differentially expressed genes, but the only annotation we can see is the Entrez Gene ID, which is not very informative. 
```{r}
results <- as.data.frame(topTags(lrt.BvsL,n = Inf))
results

```

`edgeR` provides a function `plotSmear` that allows us to visualise the results of a DE analysis. In a similar manner to the [*MA-plot* for microarray data](https://en.wikipedia.org/wiki/MA_plot), this plot shows the log-fold change against log-counts per million, with DE genes highlighted:

```{r}
summary(de <- decideTestsDGE(lrt.BvsL))
detags <- rownames(dgeObj)[as.logical(de)]
plotSmear(lrt.BvsL, de.tags=detags)
```
However, on such a plot it would be nice to add labels to highlight the genes with most evidence for being DE, or our favourite genes. To perform such a task we need to map between the identifiers we have in the `edgeR` output and more familiar names.

There are a number of ways to add annotation, but we will demonstrate how to do this using the *org.Mm.eg.db* package. This package is one of several *organism-level* packages which are re-built every 6 months. These packages are listed on the [annotation section](http://bioconductor.org/packages/release/BiocViews.html#___AnnotationData) of the Bioconductor, and are installed in the same way as regular Bioconductor packages. An alternative approach is to use `biomaRt`, an interface to the [BioMart](http://www.biomart.org/) resource. BioMart is much more comprehensive, but the organism package fit better into the Bioconductor workflow.


```{r eval=FALSE}
source("http://www.bioconductor.org/biocLite.R")
biocLite("org.Mm.eg.db")
# For Human
biocLite("org.Hs.eg.db")
```

The packages are larger in size that Bioconductor software pacakges, but essentially they are databases that can be used to make *offline* queries. 

```{r}
library(org.Mm.eg.db)
```


First we need to decide what information we want. In order to see what we can extract we can run the `columns` function on the annotation database.

```{r}
columns(org.Mm.eg.db)
```

We are going to filter the database by a key or set of keys in order to extract the information we want. Valid names for the key can be retrieved with the `keytypes` function.

```{r}
keytypes(org.Mm.eg.db)
```

We should see `ENTREZID`, which is the type of key we are going to use in this case. If we are unsure what values are acceptable for the key, we can check what keys are valid with `keys`

```{r}
keys(org.Mm.eg.db, keytype="ENTREZID")[1:10]
```

It is a useful sanity check to make sure that the keys you want to use are all valid. We could use `%in%` in this case.

```{r}
## Build up the query step-by-step
my.keys <- c("50916", "110308","12293")
my.keys %in% keys(org.Mm.eg.db, keytype="ENTREZID")
all(my.keys %in% keys(org.Mm.eg.db, keytype="ENTREZID"))
```

Let's build up the query step by step.

```{r eval=FALSE}
## to be filled-in interactively during the class.
select(org.Mm.eg.db,


```



To annotate our results, we definitely want gene symbols and perhaps the full gene name. Let's build up our annotation information in a separate data frame using the `select` function.

```{r}
ann <- select(org.Mm.eg.db,keys=rownames(results),columns=c("ENTREZID","SYMBOL","GENENAME"))
# Have a look at the annotation
ann
```

Let's double check that the `ENTREZID` column matches exactly to our `results` rownames.

```{r}
table(ann$ENTREZID==rownames(results))
```

We can bind in the annotation information to the `results` data frame. (Please note that if the `select` function returns a 1:many mapping then you can't just append the annotation to the fit object.)

```{r}
results.annotated <- cbind(results, ann)
results.annotated
```


We can save the results table using the `write.csv` function, which writes the results out to a csv file that you can open in excel.

```{r}
write.csv(results.annotated,file="B.PregVsLacResults.csv",row.names=FALSE)
```

**A note about deciding how many genes are significant**: In order to decide which genes are differentially expressed, we usually take a cut-off of 0.05 on the adjusted p-value, NOT the raw p-value. This is because we are testing more than 15000 genes, and the chances of finding differentially expressed genes is very high when you do that many tests. Hence we need to control the false discovery rate, which is the adjusted p-value column in the results table. What this means is that if 100 genes are significant at a 5\% false discovery rate, we are willing to accept that 5 will be false positives. Note that the `decideTests` function displays significant genes at 5\% FDR.

> ## Challenge {.challenge}
>
> Re-visit the `plotSmear` plot from above and use the `text` function to add labels for the names of the top 200 most DE genes
>

```{r,echo=FALSE,fig.height=5,fig.width=10}

plotSmear(lrt.BvsL, de.tags=detags)

N <- 200

text(results.annotated$logCPM[1:N],results.annotated$logFC[1:N],labels = results.annotated$SYMBOL[1:N],col="blue")
```


Another common visualisation is the [*volcano plot*](https://en.wikipedia.org/wiki/Volcano_plot_(statistics)) which display a measure of significance on the y-axis and fold-change on the x-axis. 

```{r,fig.height=5,fig.width=10}
signif <- -log10(results.annotated$FDR)
plot(results.annotated$logFC,signif,pch=16)
points(results.annotated[detags,"logFC"],-log10(results.annotated[detags,"FDR"]),pch=16,col="red")

```


Before following up on the DE genes with further lab work, a recommended *sanity check* is to have a look at the expression levels of the individual samples for the genes of interest. We can quickly look at grouped expression using `stripchart`. We can use the normalised log expression values in the  `dgeCounts` object (`dgeCounts$counts`).

```{r,fig.width=12,fig.height=5}
library(RColorBrewer)
par(mfrow=c(1,3))
normCounts <- dgeObj$counts
# Let's look at the first gene in the topTable, Irx4, which has a rowname 50916
stripchart(normCounts["50916",]~group)
# This plot is ugly, let's make it better
stripchart(normCounts["50916",]~group,vertical=TRUE,las=2,cex.axis=0.8,pch=16,col=1:6,method="jitter")
# Let's use nicer colours
nice.col <- brewer.pal(6,name="Dark2")
stripchart(normCounts["50916",]~group,vertical=TRUE,las=2,cex.axis=0.8,pch=16,cex=1.3,col=nice.col,method="jitter",ylab="Normalised log2 expression",main="	Irx4")
```

An interactive version of the volcano plot above that includes the raw per sample values in a separate panel is possible via the `glXYPlot` function in the *Glimma* package.


```{r}
library(Glimma)
group2 <- group
levels(group2) <- c("basal.lactate","basal.preg","basal.virgin","lum.lactate", "lum.preg", "lum.virgin")
glXYPlot(x=results$logFC, y=-log10(results$FDR),
         xlab="logFC", ylab="B", main="B.PregVsLac",
         counts=normCounts, groups=group2, status=de,
         anno=ann, id.column="ENTREZID", folder="volcano")
```


This function creates an html page (./volcano/XY-Plot.html) with a volcano plot on the left and a plot showing the log-CPM per sample for a selected gene on the right. A search bar is available to search for genes of interest.



## Retrieving Genomic Locations


It might seem natural to add genomic locations to our annotation table, and possibly a bit odd that the `org.Mm.eg.db` package does not supply such mappings. In fact, there is a whole suite of package for performing this, and more-advanced queries. These are listed on the Bioconductor [annotation page](http://bioconductor.org/packages/release/BiocViews.html#___AnnotationData) and have the prefix `TxDb.`

The package we will be using is `TxDb.Mmusculus.UCSC.mm10.knownGene`. Packages are available for other organisms and genome builds. It is even possible to *build your own database* if one does not exist. See `vignette("GenomicFeatures")` for details

```{r eval=FALSE}
source("http://www.bioconductor.org/biocLite.R")
biocLite("TxDb.Mmusculus.UCSC.mm10.knownGene")

## For Humans
biocLite("TxDb.Hsapiens.UCSC.hg19.knownGene")

```

We load the library in the usual fashion and create a new object to save some typing. As with the `org.` packages, we can query what columns are available with `columns`,

```{r}
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
tx <- TxDb.Mmusculus.UCSC.mm10.knownGene
columns(tx)
```

The `select` function is used in the same manner as the `org.Mm.eg.db` packages.

```{r}
keys <- c("12992", "13358","20531")
select(tx, keys=keys,
       keytype = "GENEID",
       columns=c("EXONID","EXONSTART","EXONEND"))

```

One of the real strengths of the `txdb..` packages is the ability of interface with `GenomicRanges`, which is the object type used throughout Bioconductor [to manipulate Genomic Intervals](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3738458/pdf/pcbi.1003118.pdf). 

These object types permit us to perform common operations on intervals such as overlapping and counting.
```{r}
library(GenomicRanges)
my.range <-GRanges("chr1", IRanges(start=1000,end=2000))
my.range

```

Let's try some dummy intervals with random chromsome assignments and start positions. 
```{r}
set.seed(10042017)
chrs <- sample(paste0("chr",1:19),100,replace=TRUE)
startPos <- sample(1:5e6,100,replace=TRUE)
my.rangesA <- GRanges(chrs, IRanges(start=startPos,width=1000))
my.rangesA

```

There are a number of useful functions for calculating properties of the data (such as *coverage* or sorting). It is even possible to convert between different chromosome naming conventions.

```{r}
width(my.rangesA)
sort(my.rangesA)
coverage(my.rangesA)
seqlevelsStyle(my.rangesA)
keepSeqlevels(my.rangesA,"chr19")
#seqlevelsStyle(my.rangesA) <- "Ensembl"
#my.rangesA
```


```{r}
my.rangesB <- shift(my.rangesA, shift = 500)
my.rangesB <- resize(my.rangesB,width =sample(1:1000,100,replace = TRUE))
my.rangesB
```

It is quite straightforward to translate the output of a `select` query into a `GenomicFeatures` object. However, several convenience functions exist to retrieve the structure of every gene for a given organism in one object. 

The output of `exonsBy` is a list, where each item in the list is the exon co-ordinates of a particular gene. 

```{r}
exo <- exonsBy(tx,"gene")
exo
```

To access the structure of a particular gene, we can use the `[[` syntax with the name of the gene (Entrez gene ID) within quote marks.

```{r}
exo[["12992"]]
```


## Exporting tracks

It is also possible to save the results of a Bioconductor analysis in a browser to enable interactive analysis
and integration with other data types, or sharing with collaborators. For instance, we might want a
browser track to indicate where our differentially-expressed genes are located. We shall use the bed
format to display these locations.

At the moment, we have a GenomicFeatures object that represents every exon. However, we do not
need this level of granularity for the bed output

Select the names of the statistically significant genes from the edgeR output in the usual
manner.

```{r}
tt <- topTable(fit.cont,coef="B.PregVsLac",n=Inf)
sigGenes <- tt[tt$adj.P.Val < 0.05,]
sigGenes
```

Use the `range` function to obtain a single range for every gene and then get the genomic intervals that correspond to these DE genes.

```{r}
exoRanges <- unlist(range(exo))
sigRegions <- exoRanges[na.omit(match(sigGenes$ENTREZID, names(exoRanges)))]
sigRegions
```

Rather than just representing the genomic locations, the .bed format is also able to colour each range
according to some property of the analysis (e.g. direction and magnitude of change) to help highlight
particular regions of interest. A score can also be displayed when a particular region is clicked-on.
A useful propery of GenomicRanges is that we can attach metadata to each range using the mcols
function. The metadata can be supplied in the form of a data frame.

```{r}
mcols(sigRegions) <- sigGenes[match(names(sigRegions), sigGenes[,1]),]
sigRegions
```

```{r}
seqlevels(sigRegions)
sigRegions <- keepSeqlevels(sigRegions, paste0("chr", c(1:19,"X","Y")))
```

Create a score from the p-values that will displayed under each region, and colour scheme
for the regions based on the fold-change. For convenience, restrict the fold changes to be within the
region -3 to 3.

```{r}
sigRegions
Score <- -log10(sigRegions$adj.P.Val)
rbPal <-colorRampPalette(c("red", "blue"))
logfc <- pmax(sigRegions$logFC, -3)
logfc <- pmin(logfc , 3)
Col <- rbPal(10)[as.numeric(cut(logfc, breaks = 10))]
```

The colours and score can be saved in the GRanges object and score and itemRgb columns respectively
and will be used to construct the track when exporting. The rtracklayer function can be used to import
and export browsers tracks.

 Export the signifcant results from the DE analysis as a `.bed` track using `rtracklayer`. You
can load the resulting file in IGV, if you wish.
```{r}
mcols(sigRegions)$score <- Score
mcols(sigRegions)$itemRgb <- Col
library(rtracklayer)
export(sigRegions , con = "topHits.bed")
```

## Extracting Reads

We haven't discussed much 

```{r}
library(GenomicAlignments)
```

```{r}
generegion <- exo[["12992"]]
getwd()
bam <- readGAlignments(file="bam/MCL1.DG.bam",
                       param=ScanBamParam(which=generegion))
bam
```



## An aside - Acessing Genome Sequence

```{r}

```

## Brief Introduction to ggplot2


```{r}
library(ggplot2)

top200 <- tt[1:200,]

ggplot(tt, aes(x = logFC, y=B)) + geom_point(alpha=0.4) + scale_color_manual(values=c("black","red")) + geom_text(data=top200, aes(x=logFC,y=B,label=SYMBOL),col="blue",alpha=0.4)

```


> ## Challenge {.challenge}
>
> Use ggplot2 to re-construct the MA- plot from above.
>

**Solution**

```{r,echo=FALSE}
# Soution
tt <- topTable(fit.cont,coef="B.PregVsLac",n=Inf)
ggplot(tt, aes(x = AveExpr, y=logFC,col=adj.P.Val < 0.05)) + geom_point() + scale_color_manual(values=c("black","red")) 

```


## Composing plots with ggbio

We will now take a brief look at one of the visualisation packages in Bioconductor that takes advantage
of the GenomicRanges and GenomicFeatures object-types. In this section we will show a worked
example of how to combine several types of genomic data on the same plot. The documentation for
ggbio is very extensive and contains lots of examples.

http://www.tengfei.name/ggbio/docs/

The Gviz package is another Bioconductor package that specialising in genomic visualisations, but we
will not explore this package in the course.

The Manhattan plot is a common way of visualising genome-wide results, and this is implemented as the
`plotGrandLinear` function. We have to supply a value to display on the y-axis, typically this is some
measure of significance. Changing this variable displayed on the y-axis is done by using the `aes` function,
which is inherited from ggplot2. The positioning of points on the x-axis is handled automatically by
ggbio, using the ranges information to get the genomic coordinates of the ranges of interest.

```{r}
library(ggbio)
plotGrandLinear(sigRegions , aes(y = score))

```

As we discussed, an appealing feature of `ggplot2` is the ability to map properties of your plot to variables present in your data. For example, we could create a variable to distinguish between up- and down-regulated genes. The variables used for aesthetic mapping must be present in the `mcols` section of your ranges object.

```{r}
mcols(sigRegions)$Up <- logfc > 0
plotGrandLinear(sigRegions, aes(y = logFC, col = Up))
```



A useful function within ggbio is `autoplot`, which will construct an appropriate plot based on the
object-type of the input. For example, ggbio is able to plot the structure of genes according to a
particular model represented by a `GenomicFeatures` object.


```{r}
autoplot(tx, which=exo[["24117"]])
```

We can even plot the location of sequencing reads if they have been imported using readGAlignments
function (or similar).

```{r}
myreg <- flank(reduce(exo[["24117"]]), 1000, both = T)
bam <- readGAlignments(file="bam/MCL1.DG.bam",
                       param=ScanBamParam(which=myreg),use.names = TRUE)

autoplot(bam,which=myreg)
```

```{r}
autoplot(bam , stat = "coverage")
```
Like ggplot2, ggbio plots can be saved as objects that can later be modified, or combined together to
form more complicated plots. If saved in this way, the plot will only be displayed on a plotting device
when we query the object. This strategy is useful when we want to add a common element (such as
an ideogram) to a plot composition and don’t want to repeat the code to generate the plot every time.

```{r}
#idPlot <- plotIdeogram(genome = "mm10")
#idPlot
geneMod <- autoplot(tx, which = myreg)
reads1 <- autoplot(bam, stat = "coverage")
tracks(geneMod ,reads1)
```


